# CS250B: Modern Computer Systems Performance Profiling with PerfTools

Sang-Woo Jun



#### How To Evaluate Our Approaches?

- ☐ Say, we made a performance engineering change in our program
  - ...And performance decreased by 10%
  - O Why? Can we know?
- Many tools provide profiling capabilities
  - o gprof, OProfile, Valgrind, VTune, PIN, ...
- ☐ We will talk about perf, part of perf tools
  - Native support in the Linux kernel
  - Straightforward PMC (Performance Monitoring Counter) support

# Aside: Performance Monitoring Counters (PMC)

- ☐ Problem: How can we measure architectural events?
  - L1 cache miss rates, branch mis-predicts, total cycle count, instruction count, ...
  - No way for software to know
  - Events happen too often for software to be counting them
- ☐ Solution: PMCs (Sometimes called Hardware Performance Counters)
  - Dozens of special registers that can each be programmed to count an event
  - Privileged registers, only accessible by kernel
  - Supported PMCs differ across models and designs
- ☐ Usage
  - Program PMC, read PMC, run piece of code, read PMC, compare read values

#### Linux Perf

- ☐ Performance analysis tool in Linux
  - Natively supported by kernel
  - Supports profiling a VERY wide range of events: PMC to kernel events
  - Note: needs sudo to do most things

- ☐ Many operation modes: top, stat, record, report, ...
  - Supported events found in "sudo perf list"

```
List of pre-defined events (to be used in -e):
  branch-instructions OR branches
                                                         [Hardware event]
  branch-misses
                                                         [Hardware event
  bus-cycles
                                                         [Hardware event]
                                                         [Hardware event
  cache-misses
  cache-references
                                                         Hardware event
  cpu-cycles OR cycles
                                                         [Hardware event
  instructions
                                                         Hardware event
  ref-cvcles
                                                         <u>[Har</u>dware event
```

```
page-faults OR faults
task-clock

L1-dcache-load-misses
L1-dcache-loads
L1-dcache-stores
L1-icache-load-misses
```

#### Linux Perf: Stat

- ☐ Default command prints some useful information
  - "sudo perf stat ls"
- $\square/M$ ore events can be traced using -e
  - sudo perf stat -e task-clock,page-faults,cycles,instructions,branches,branch-misses,LLC-loads,LLC-load-misses ls

```
Performance counter stats for 'ls':
                        task-clock (msec)
                                                        0.805 CPUs utilized
         0.652008
                        context-switches
                                                       0.000 K/sec
                        cpu-migrations
                                                       0.000 K/sec
                       page-faults
              104
                                                        0.160 M/sec
        2,797,861
                        cvcles
                                                        4.291 GHz
                                                       0.80 insh per cycle
        2,245,082
                        instructions
                                                     681.119 M/sec
          444,095
                        branches
           16.749
                        branch-misses
                                                        3.77% of all branches
      0.000810402 seconds time elapsed
```

```
Performance counter stats for 'ls':
         0.681485
                       task-clock (msec)
                                                       0.781 CPUs utilized
                       page-faults
                                                       0.150 M/sec
        2,921,152
                       cycles
                                                       4.286 GHz
                                                       0.76 insn per cycle
        2,217,325
                       instructions
                       branches
          439,589
                                                     645.046 M/sec
                       branch-misses
                                                       3.78% of all branches
           16,608
            9,736
                       LLC-loads
                                                      14.286 M/sec
            3,269
                       LLC-load-misses
                                                      33.58% of all LL-cache hits
      0.000872088 seconds time elapsed
```

#### Linux Perf: Record, Report

☐ Log events with "record", interactively analyze it with "report"

[kernel.kallsyms]

- o sudo perf record -e cycles, instructions, L1-dcache-loads, L1-dcache-load-misses [...]
- Creates "perf.data"
- ☐ "sudo perf report" reads "perf.data"

```
Available samples
8 cycles
8 instructions
8 L1-dcache-loads
7 L1-dcache-load-misses
```

```
This is where
verhead Command Shared Object
                 [kernel.kallsyms]
                                    [k] memcpy erms
                                                                              most cycles are spent!
                 [kernel.kallsvms]
                                       perf_event_addr_filters_exec
        perf
                 [kernel.kallsyms]
        perf
                                    [k] native write msr
                 [kernel.kallsyms]
                                    [k] native write msr
        tail
          of event 'L1-dcache-load-misses', Event count (approx.): 36818
                                                                                 This is where
                Shared Object
                                                                                 most L1 cache misses are!
                 [kernel.kallsyms]
        perf
                                    [k] perf iterate ctx
                 [kernel.kallsyms]
                                    [k] perf event addr filters exec
        perf
                 [kernel.kallsyms]
        perf
                                    [k] native write msr
```

[k] native write msr

#### CS 250B: Modern Computer Systems

Modern Processors – Handling Branches

Sang-Woo Jun



#### What Do Conditionals Compile To?

- ☐ Conditionals (sometimes) compile to branch instructions in assembly
  - Compiler optimizations may replace branch instructions with something else
  - But not always
- ☐ Branch instructions take cycles(s)
  - At least one cycle, perhaps more
  - Obvious!

```
int bar(int v) {
   if ( v == 0 ) return 1;
   else return 0;
}
```

```
gcc, x86-64,
no optimizations
```

```
bar(int):
        push
                 rbp
                 rbp, rsp
        mov
                 DWORD PTR [rbp-4], edi
        mov
                 DWORD PTR [rbp-4], 0
        cmp
                 .L2
        jne
                 eax, 1
        mov
                 .L3
        jmp
.L2:
                 eax, 0
        mov
.L3:
                 rbp
        pop
        ret
```

Generated using GCC explorer: <a href="https://gcc.godbolt.org/">https://gcc.godbolt.org/</a>

## Remember: Pipelined Processors and Hazards

- Modern, pipelined processors handle multiple instructions at once
  - Ideally, N-stage pipeline processes N instructions at a given cycle
  - But, sometimes future instructions depend on results of earlier ones ("Hazard")
  - Many types of hazards were introduced in undergrad architecture class
- ☐ Today, we look at the impact of handling "Control hazards"



#### Handling Control Hazards

- ☐ Branch determines flow of control
  - Fetching next instruction depends on branch outcome
  - Pipeline can't always fetch correct instruction
    - e.g., Still working on decode stage of branch

i1: beq s0, zero, elsewhere

i2: addi s1, s0, 1

elsewhere:

i3: addi s1, s0, 2



## Control Hazard (Partial) Solution: Branch Prediction

- ☐ Processor will try to predict whether branch is taken or not
  - o If prediction is correct, great!
    - Single cycle overhead
  - If not, we do not apply the effects of mis-predicted instructions
    - Effectively same performance penalty as stalling in this case
    - Can be many cycles of overhead depending on pipeline depth

#### Simple Branch Predictor Example



No state update before Execute stage can detect misprediction (Fetch and Decode stages don't write to register)

#### Some Classes Of Branch Predictors

- ☐ Static branch prediction
  - Based on typical branch behavior
  - Example: loop and if-statement branches
    - Predict backward branches taken
    - Predict forward branches not taken
- Dynamic branch prediction
  - Hardware measures actual branch behavior
    - e.g., record recent history (1-bit "taken" or "not taken") of each branch in a fixed size "branch history table"
  - Assume future behavior will continue the trend
    - When wrong, stall while re-fetching, and update history

#### Branch prediction and performance

- ☐ Effectiveness of branch predictors is crucial for performance
  - Spoilers: On SPEC benchmarks, modern predictors routinely have 98+% accuracy
  - Of course, less-optimized code may have much worse behavior
- ☐ Branch-heavy software performance depends on good match between software pattern and branch prediction
  - Some high-performance software optimized for branch predictors in target hardware
  - Or, avoid branches altogether! (Branchless code)

## Recap: Loop unrolling A Compiler Solution To Branch Hazards

```
for ( i = 0 to 15 ) foo();

Loop unrolling

foo();

foo();

foo();

foo();

foo();

foo();
```

Potentially <u>16</u> branch mispredicts Even without mispredicts, branch instruction consume <u>16</u> cycles Potentially 4 branch mis-predicts
Without mis-predicts,
branch instruction consume 4 cycles

We can do this manually, or tell the compiler to do its best

- GCC flags -funroll-loops, -funroll-all-loops
- How much to unroll depends on heuristics within compiler

#### Code Example: Counting Numbers

- ☐ How fast is the following code?
  - o a and b are initialized to rand()%256
  - o cnt is 100,000,000
  - Compiled with GCC –O3

```
for ( int i = 0; i < cnt; i++ ) {
>   if ( a[i] < 128 && b[i] < 128 ) lcnt++;
}</pre>
```

- ☐ This code takes 0.44s on my desktop (i5 @ 3 GHz)
  - Each loop takes 13.2 cycles (3 GHz \* 0.44 / 100,000,000)
  - Can we do better? My x86 is 4-way superscalar!

## Optimization Attempt #1: Loop Unrolling

- ☐ There are three potential branch instruction locations
  - o "i < cnt", "a[i] < 128", and b[i] < 128"
- ☐ Is the bottleneck the "for" loop?
  - Let's try giving -funroll-all-loops

```
for ( int i = 0; i < cnt; i++ ) {
>   if ( a[i] < 128 && b[i] < 128 ) lcnt++;
}</pre>
```

- ☐ Performance increased from 0.44s to ~0.43s.
  - Better, but not by much

#### Identifying The Bottleneck

- ☐ We predict the "if" statements are the bottlenecks
  - Each of the two branch instructions has a 50% chance of being taken
  - Branch prediction very inefficient!

```
for ( int i = 0; i < cnt; i++ ) {
> if ( a[i] < 128 && b[i] < 128 ) lcnt++;
}</pre>
```

- ☐ Performance improves when comparison becomes skewed
  - 0.44s when comparing against 128 (50%)
  - 0.27s when comparing against 64 (25%), 0.17s with 32

#### Optimization Attempt #2: Branchless Code

- ☐ Let's try getting rid of the "if" statement. How?
- ☐ Some knowledge of architectural treatment of numbers is required
  - o x86 represents negative numbers via two's complement
  - $\circ$  "1" == 0x1, "-1" == 0xffffffff
  - $\circ$  "1>>31" == 0x0, "-1>>31" == 0xffffffff
- ☐ "(v-128)>>31"
  - $\circ$  if v >= 128: 0x0
  - v < 128: 0xfffffff

So many more instructions! Will this be faster?

```
for ( int i = 0; i < cnt; i++ ) {
> lcnt += ( (((a[i] - 128)>>31)&1) * (((b[i] - 128)>>31)&1) );
}
```

#### Comparing Performance Numbers

| Name       | Elapsed<br>(s) |
|------------|----------------|
| Vanilla    | 0.44 s         |
| Branchless | 0.06 s         |



Branch predictor is almost always correct

Vanilla: Total misses: 57 M out of 3,623 M

| Overhead | Command | Shared Object     | Symbol                                |
|----------|---------|-------------------|---------------------------------------|
| 87.38%   | a.out   | a.out             | [.] main                              |
| 9.80%    | a.out   | libc-2.27.so      | [.]random                             |
| 1.48%    | a.out   | libc-2.27.so      | [.]random_r                           |
| 0.36%    | a.out   | [kernel.kallsyms] | <pre>[k]pagevec_lru_add_fn</pre>      |
| 0.29%    | a.out   | [kernel.kallsyms] | <pre>[k] get_page_from_freelist</pre> |

~2 cycles per loop! 8 Operations with 4 way superscalar...

Branchless: Total misses: 7 M out of 3,514 M

Over 7x performance!

```
Command
         Shared Object
                              Symbol Symbol
         libc-2.27.so
                                   random
a.out
         libc-2.27.so
a.out
                                    random r
                                  get page from freelist
          [kernel.kallsyms]
a.out
          [kernel.kallsyms]
                                    pagevec lru add fn
a.out
a.out
          [kernel.kallsyms]
                                    handle mm fault
                                  main
a.out
          a.out
         libc-2.27.so
a.out
                                  rand
```

Interestingly, loop with only one comparator is automatically optimized by compiler

```
for ( int i = 0; i < cnt; i++ ) {
>  if ( a[i] < 128 ) lcnt ++;
}</pre>
```

Shows same performance as the branchless one

#### Aside: Spectre (Simplified)

- Branch prediction is supposed to be transparent
- But not all aspects of it are!

```
SPECTRE
```

#### Slightly Deeper into Speculative Execution

- ☐ Branch prediction is one example of Speculative Execution
- Modern processors speculatively execute many things!
  - e.g., Does this Virtual Memory page belong to the Kernel?
  - Assume we have access to everything, and roll back state if turns out to be false
  - Always waiting for checks is too much overhead!

## Aside: Meltdown (Simplified)

- Caches are supposed to be transparent
- ☐ But not all aspects of it are!



```
int array[256];
char forbidden = (*forbidden_ptr);
int ignore = array[forbidden];
for ( int i = 0; i < 256; i++ ) is_in_cache(array[i]);</pre>
```

Assuming (\*forbidden ptr) returns zeros in the correct operation, instead of protection fault

Always being safe is too much overhead. How do we fix this?

#### Questions?

#### CS 250B: Modern Computer Systems

Modern Processors – SIMD Extensions

Sang-Woo Jun



#### Modern Processor Topics

- ☐ Transparent Performance Improvements
  - Pipelining, Caches
  - Superscalar, Out-of-Order, Branch Prediction, Speculation, ...
  - Covered in CS250A and others
- ☐ Explicit Performance Improvements
  - SIMD extensions, AES extensions, ...
  - O ...
- Non-Performance Topics
  - Virtualization extensions, secure enclaves, transactional memory, ...

## Flynn Taxonomy (1966) Recap

|                       |        | Data Stream                      |                                           |
|-----------------------|--------|----------------------------------|-------------------------------------------|
|                       |        | Single                           | Multi                                     |
| Instruction<br>Stream | Single | SISD<br>(Single-Core Processors) | SIMD<br>(GPUs, Intel SSE/AVX extensions,) |
|                       | Multi  | MISD<br>(Systolic Arrays,)       | MIMD<br>(VLIW, Parallel Computers)        |

#### Flynn Taxonomy Recap



Single-Instruction
Single-Data
(Single-Core Processors)



Single-Instruction
Multi-Data
(GPUs, SIMD Extensions)



Multi-Instruction
Single-Data
(Systolic Arrays,...)



Multi-Instruction Multi-Data (Parallel Computers)

#### Intel SIMD Extensions

- ☐ New instructions, new registers
- ☐ Introduced in phases/groups of functionality
  - SSE SSE4 (1999 –2006)
    - 128 bit width operations
  - AVX, FMA, AVX2, AVX-512 (2008 2015)
    - 256 512 bit width operations
- ☐ F16C, and more to come?

#### Ice Lake Annotated Die (2019)





## Sandy Bridge Microarchitecture (2011)



e.g., "Port 5 pressure" when code uses too much shuffle operations

## Skylake-X Microarchitecture (2019)



#### Apple M1 Microarchitecture (2020)



## Intel SIMD Registers (AVX-512)



#### SSE/AVX Data Types



Operation on 32 8-bit values in one instruction!

#### Complexity of AVX-512



### Aside: Do I Have SIMD Capabilities?

☐ less /proc/cpuinfo

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat p se36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm con stant\_tsc art arch\_perfmon pebs bts rep\_good nopl xtopology nonstop\_tsc cpuid aperfmp erf tsc\_known\_freq pni pclmulgdq dtes64 monitor ds\_cpl vmx est tm2 ssse3 sdbg fma cx1 6 xtpr pdcm pcid sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f 16c rdrand lahf\_lm abm 3dnowprefetch cpuid\_fault epb invpcid\_single pti ssbd ibrs ibp b stibp tpr\_shadow vnmi flexpriority ept vpid fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel\_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp\_notify hwp\_act\_window hwp\_epp flush\_l1d

# Processor Microarchitectural Effects on Power Efficiency

- ☐ The majority of power consumption of a CPU is not from the ALU
  - Cache management, data movement, decoding, and other infrastructure
  - Adding a few more ALUs should not impact power consumption
- Indeed, 4X performance via AVX does not add 4X power consumption
  - From i7 4770K measurements with matrix multiplication:
  - Idle: 40 W
  - Under load: 117 W
  - Under AVX load: 128 W

### Compiler Automatic Vectorization

☐ In gcc, flags "-O3 -mavx -mavx2" attempts automatic vectorization

.L2:

☐ Works pretty well for simple loops

```
int a[256], b[256], c[256];
void foo () {
   for (int i=0; i<256; i++) a[i] = b[i] * c[i];
}

vmovdqa xmm1, XMMWORD P
add rax, 16
vpmulld xmm0, xmm1, XMM
vmovaps XMMWORD PTR a[r
cmp rax, 1024</pre>
```

☐ But not for anything complex

o E.g., naïve bubblesort code not parallelized at all

vmovdqa xmm1, XMMWORD PTR b[rax]
add rax, 16
vpmulld xmm0, xmm1, XMMWORD PTR c[rax-16]
vmovaps XMMWORD PTR a[rax-16], xmm0
cmp rax, 1024
jne .L2

Generated using GCC explorer: <a href="https://gcc.godbolt.org/">https://gcc.godbolt.org/</a>

### Intel SIMD Intrinsics

- ☐ Use C functions instead of inline assembly to call AVX instructions
- ☐ Compiler manages registers, etc
- Intel Intrinsics Guide
  - https://software.intel.com/sites/landingpage/IntrinsicsGuide
  - One of my most-visited pages...

```
e.g.,

__m256 a, b, c;

__m256 d = _mm256_fmadd_ps(a, b, c); // d[i] = a[i]*b[i]+c[i] for i = 0 ...7
```

### Data Types in AVX/AVX2

| Туре  | Description                         |                               |
|-------|-------------------------------------|-------------------------------|
| m128  | 128-bit vector containing 4 floats  |                               |
| m128d | 128-bit vector containing 2 doubles |                               |
| m128i | 128-bit vector containing integers  | 1~16 signed/unsigned integers |
| m256  | 256-bit vector containing 8 floats  |                               |
| m256d | 256-bit vector containing 4 doubles |                               |
| m256i | 256-bit vector containing integers  | 1~32 signed/unsigned integers |

m512 variants also for AVX-512

### Intrinsic Naming Convention

```
_ mm<width>_[function]_[type]
```

E.g., \_mm256\_fmadd\_ps:
 perform fmadd (fused multiply-add) on
 256 bits of
 packed single-precision floating point values (8 of them)

| Width | Prefix  |
|-------|---------|
| 128   | _mm_    |
| 256   | _mm256_ |
| 512   | _mm512_ |

Not all permutations exist! Check guide

| Туре                    | Postfix                |  |
|-------------------------|------------------------|--|
| Single precision        | _ps                    |  |
| Double precision        | _pd                    |  |
| Packed signed integer   | _epiNNN (e.g., epi256) |  |
| Packed unsigned integer | _epuNNN (e.g., epu256) |  |
| Scalar integer          | _siNNN (e.g., si256)   |  |

### Load/Store/Initialization Operations

```
    □ Initialization

            _mm256_setzero_ps/pd/epi32/...
            _mm256_set_...
            ...

    □ Load/Store: Variants for addresses aligned/unaligned by 256-bit

            _mm256_load_..._mm256_loadu_...
            _mm256_store_..._mm256_storeu_...

    □ And many more! (Masked read/write, strided reads, etc...)
```

```
e.g.,
__mm256d t = _mm256_load_pd(double const * mem); // loads 4 double values from mem to t
__mm256i v = _mm256_set_epi32(h,g,f,e,d,c,b,a); // loads 8 integer values to v
```

### Vertical Vector Instructions

- ☐ Add/Subtract/Multiply
  - \_mm256\_add/sub/mul/div\_ps/pd/epi
    - Mul only supported for epi32/epu32/ps/pd
    - Div only supported for ps/pd
    - Consult the guide!
- ☐ Max/Min/GreaterThan/Equals
- ☐ Sqrt, Reciprocal, Shift, etc...
- ☐ FMA (Fused Multiply-Add)
  - (a\*b)+c, -(a\*b)-c, -(a\*b)+c, and other permutations!
  - Consult the guide!



```
a
        X
               X
                      X
                             X
b
C
d
```

```
__m256 a, b, c;
__m256 d = _mm256_fmadd_pd(a, b, c);
```

### Integer Multiplication Caveat

- ☐ Integer multiplication of two N bit values require 2N bits
- ☐ E.g., \_\_mm256\_mul\_epi32 and \_\_mm256\_mul\_epu32
  - Only use the lower 4 32 bit values
  - Result has 4 64 bit values
- ☐ E.g., \_\_mm256\_mullo\_epi32 and \_\_mm256\_mullo\_epu32
  - Uses all 8 32 bit values
  - Result has 8 truncated 32 bit values
- ☐ And more options!
  - Consult the guide...

### Horizontal Vector Instructions

- ☐ Horizontal add/subtraction
  - Adds adjacent pairs of values
  - E.g., \_\_m256d \_mm256\_hadd\_pd (\_\_m256d a, \_\_m256d b)



### Shuffling/Permutation

- ☐ Within 128-bit lanes
  - \_mm256\_shuffle\_ps/pd/... (a,b, imm8)
  - \_mm256\_permute\_ps/pd
  - \_mm256\_permutevar\_ps/...
- ☐ Across 128-bit lanes
  - o mm256 permute2x128/4x64 : Uses 8 bit control
  - o \_mm256\_permutevar8x32/...: Uses 256 bit control
- ☐ Not all type permutations exist for each type, but variables can be cast back and forth between types



Matt Scarpino, "Crunching Numbers with AVX and AVX2," 2016

### Blend

- ☐ Merges two vectors using a control
  - o \_mm256\_blend\_...: Uses 8 bit control
    - e.g., \_mm256\_blend\_epi32
  - o \_mm256\_blendv\_...: Uses 256 bit control
    - e.g., \_mm256\_blendv\_epi8



### Alignr

- ☐ Right-shifts concatenated value of two registers, by byte
  - Often used to implement circular shift by using two same register inputs
  - \_mm256\_alignr\_epi8 (a, b, count)

Example of 64-bit values being shifted by 8



### Helper Instructions

- ☐ Cast
  - o \_\_mm256i <-> \_\_mm256, etc...
  - Syntactic sugar -- does not spend cycles
- Convert
  - 4 floats <-> 4 doubles, etc...
- ☐ Movemask
  - \_\_mm256 mask to -> int imm8
- ☐ And many more...

# Our Current State Of Matrix Multiply: Blocked Multiplication

- ☐ Performance is best when working set fits into cache
  - But as shown, even 2048 x 2048 doesn't fit in cache
  - -> 2048 \* 2048 \* 2048 elements read from memory for matrix B
- ☐ Solution: Divide and conquer! Blocked matrix multiply
  - For block size 32 × 32 -> 2048 \* 2048 \* (2048/32) reads



C1 sub-matrix =  $A1\times B1 + A2\times B2 + A3\times B3$  ...

### Blocked Matrix Multiply Evaluations

| Benchmark    | Elapsed (s) | Normalized<br>Performance |
|--------------|-------------|---------------------------|
| Naïve        | 63.19       | 1                         |
| Transposed   | 10.39       | 6.08                      |
| Blocked (32) | 7.35        | 8.60                      |

Bottlenecked by computation

Bottlenecked by memory

Bottlenecked by processor

Bottlenecked by memory (Not scaling!)

- ☐ AVX Transposed reading from DRAM at 14.55 GB/s
  - $\circ$  2048<sup>3</sup> \* 4 (Bytes) / 2.20 (s) = 14.55 GB/s
  - 1x DDR4 2400 MHz on machine -> 18.75 GB/s peak
  - Pretty close! Considering DRAM also used for other things (OS, etc)
- ☐ Multithreaded getting 32 GB/s effective bandwidth
  - Cache effects with small chunks

### Blocked Matrix Multiply Evaluations

| Benchmark                 | Elapsed (s) | Normalized<br>Performance |
|---------------------------|-------------|---------------------------|
| Naïve                     | 63.19       | 1                         |
| Transposed                | 10.39       | 6.08                      |
| Blocked (32)              | 7.35        | 8.60                      |
| AVX Transposed            | 2.20        | 28.72                     |
| Blocked (32) AVX          | 1.50        | 42.13                     |
| 4 Thread Blocked (32) AVX | 1.09        | 57.97                     |

- ☐ Using FMA SIMD, Cache-Oblivious AVX gets 19 GFLOPS
  - Theoretical peak is 3 GHz x 8 way SIMD == 24 GFLOPS... Close!

140x performance increase compared to the baseline!

# Case Study: Sorting

- ☐ Important, fundamental application!
- ☐ Can be parallelized via divide-and-conquer
- ☐ How can SIMD help?

### Reminder: Sorting Network

- ☐ Network structure for sorting fixed number of values
- ☐ Type of a "comparator network"
  - o comparators perform compare-and-swap
- Easily pipelined and parallelized



Example 4-element sorting network 5 comparators, 3 cycle pipelined

# Reminder: Sorting Network

☐ Simple to generate correct sorting networks, but optimal structures are not well-known









Source: Wikipedia (Sorting Network)

Some known optimal sorting networks

# SIMD And Sorting Networks

- ☐ Typically, we are sorting more than one set of tuples
  - If we have multiple tasks, we can have task-level parallelism Optimized networks!
  - Sort multiple tuples at the same time
- ☐ We first need to transpose the 8 8-element variables
  - Each variable has a value for each sorting network instance
  - Non-SIMD works, or a string of unpackhi/unpacklo/blend



### SIMD And Sorting Networks

- ☐ Some SIMD instructions have high throughput, but high latency
  - Data dependency between two consecutive max instructions can take 8 cycles on Skylake
  - If each parallel stage has less than 4 operations, pipeline may stall
    - Solution: Interleave two sets of parallel 8-tuple sorting
  - In reality, min/max means even for 4-tuples, pipeline is still filled



\_\_m256d \_mm256\_max\_pd (\_\_m256d a, \_\_m256d b) Source: Instrinsics guide

#### Performance

Architecture Latency Throughput (CPI) Skylake 4 0.5 Broadwell 3 Haswell 3 Ivy Bridge 3

# The Two Register Merge

- ☐ Sort units of two pre-sorted registers, K elements
  - $\circ$  minv = A, maxv = B
  - // Repeat K times
    - minv = min(minv,maxv)
    - maxv = max(minv,maxv)
    - // circular shift one value down
    - minv = alignr(minv, minv, sizeof(int))



# SIMD And Merge Sort

- ☐ Hierarchically merged sorted subsections
- ☐ Using the SIMD merger for sorting
  - vector\_merge is the two-register sorter from before

```
aPos = bPos = outPos = 0;
vMin = va[aPos++];
vMax = vb[bPos++];
while (aPos < aEnd && bPos < bEnd) {
  /* merge vMin and vMax */
 vector merge(vMin, vMax);
  /* store the smaller vector as output*/
  vMergedArray[outPos++] = vMin;
     load next vector and advance pointer
  /* a[aPos*4] is first element of va[aPos]
  /* and b[bPos*4] is that of vb[bPos]
  if (a[aPos*4] < b[bPos*4])
      vMin = va[aPos++];
  else
      vMin = vb[bPos++];
```

Inoue et.al., "SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures," VLDB 2015

### Topic Under Active Research!

- ☐ Papers being written about...
  - Architecture-optimized matrix transposition
  - Register-level sorting algorithm
  - Merge-sort
  - o ... and more!
- ☐ Good find can accelerate your application kernel Nx

### Questions?